In [1]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import plotly.express as px
import scipy.stats as stats
import warnings
warnings.filterwarnings("ignore")
%matplotlib inline
#¶

PART-A¶

#¶

1. Please refer the table below to answer below questions¶

1.A. Refer above table and find the joint probability of the people who planned to purchase and actually placed an order¶

In [2]:
400/2000
Out[2]:
0.2

P(planned_to_purchase and actually_placed)=planned_placed/total = 400/2000 = 0.2¶

1.B. Refer to the above table and find the joint probability of the people who planned to purchase and actually placed an order, given that people planned to purchase.¶

In [3]:
400/500
Out[3]:
0.8

P(planned_to_purchase and actually_placed | planned_to_purchase)=P(planned_to_purchase and actually_placed)/P(planned_to_purchase) = (400/2000)/(500/2000) = 0.2¶

#¶

2. An electrical manufacturing company conducts quality checks at specified periods on the products it manufactures. Historically, the failure rate for the manufactured item is 5%. Suppose a random sample of 10 manufactured items is selected. Answer the following questions¶

failure rate for the manufactured item is 5% (p) n=10

2.A. Probability that none of the items are defective?¶

In [4]:
stats.binom.pmf(k=0,n=10,p=.05)
Out[4]:
0.5987369392383787

Probability that none of the items are defective = 0.5987¶

2.B. Probability that exactly one of the items is defective?¶

In [5]:
stats.binom.pmf(k=1,n=10,p=.05)
Out[5]:
0.3151247048623047

Probability that exactly one of the items is defective = 0.3151¶

2.C. Probability that two or fewer of the items are defective?¶

In [6]:
stats.binom.cdf(k=2,n=10,p=.05)
Out[6]:
0.9884964426207031

Probability that two or fewer of the items are defective = 0.9884¶

2.D. Probability that three or more of the items are defective ?¶

In [7]:
1-stats.binom.cdf(k=2,n=10,p=.05)
Out[7]:
0.01150355737929687

Probability that three or more of the items are defective = 0.0115¶

#¶

3. A car salesman sells on an average 3 cars per week.¶

Mean sells is 3

3.A. What is Probability that in a given week he will sell some cars?¶

In [8]:
1-stats.poisson.pmf(0,3)
Out[8]:
0.950212931632136

Probability that in a given week he will sell some cars = 1-Probability that he sells 0 car for given week = 1-0.4978 = 0.9502¶

3.B. What is Probability that in a given week he will sell 2 or more but less than 5 cars?¶

In [9]:
stats.poisson.cdf(4,3)-stats.poisson.cdf(1,3)
Out[9]:
0.6161149710523164

Probability that in a given week he will sell 2 or more but less than 5 cars = P(x<=4) - P(x<=1) = 0.61611¶

3.C. Plot the poisson distribution function for cumulative probability of cars sold per-week vs number of cars sold per week.¶

In [10]:
k = np.arange(0,11)
poisson = stats.poisson.pmf(k,3)
# Plotting graph for 0-10 cars sold
plt.figure(figsize=(10,7))
plt.bar(k, poisson) # make a bar plot
plt.title("Poisson distribution function for average sells per week :3", fontsize=18) # set the title
plt.xlabel("Number of Cars Sold", fontsize=14) # set the x-axis label
plt.ylabel("Probability of cars sold per-week", fontsize=14) # set the y-axis label
plt.show() # display the plot
#¶

4. Accuracy in understanding orders for a speech based bot at a restaurant is important for the Company X which has designed, marketed and launched the product for a contactless delivery due to the COVID-19 pandemic. Recognition accuracy that measures the percentage of orders that are taken correctly is 86.8%. Suppose that you place an order with the bot and two friends of yours independently place orders with the same bot. Answer the following questions.¶

percentage of orders that are taken correctly is 86.8% (p) n=3

4.A. What is the probability that all three orders will be recognised correctly?¶

In [11]:
stats.binom.pmf(k=3,n=3,p=0.868)
Out[11]:
0.653972032

Probability that all three orders will be recognised correctly = P(x=3) = 0.6539¶

4.B. What is the probability that none of the three orders will be recognised correctly?¶

In [12]:
stats.binom.pmf(k=0,n=3,p=0.868)
Out[12]:
0.0022999680000000003

Probability that none of the three orders will be recognised correctly = P(x=0) = 0.0022¶

4.C. What is the probability that at least two of the three orders will be recognised correctly?¶

In [13]:
1-stats.binom.cdf(k=1,n=3,p=0.868)
Out[13]:
0.952327936

Probability that at least two of the three orders will be recognised correctly = P(x>=2) = 1-P(x<2) = 0.9523¶

#¶

5. Explain 1 real life industry scenario (other than the ones mentioned above) where you can use the concepts learnt in this module of Applied Statistics to get data driven business solution.¶

Scenario-1:¶

In Amazon, sellers are onboarded on the basis some criteria that needs to be validated. As per the validation performed, sellers are notified with the acceptance or rejection. Once seller is accepted and have completed all the required tasks, seller can explore and advertise his products on the basis of his needs. Seller can perform different metrics checks and ads monitoring like, total number of ad clicks, total cost of ad clicks, search term used by the customer, etc.

Scenario-2:¶

Every year there is Amazon Freedom sale where customers can purchase products with some discounts and all. As this sale happens every year, and we have all pre-historic data where we can compare all the pre-data with the new customer registrations.

Statistics and Probability Uses:¶

1) As sellers are accepted and rejected on the basis of validation, we can use binomial distribution to check acceptance ratio, and how many sellers are successfully onboarded by Amazon in a day. We can calculate the probability percentage which can be used to know how many engineers needs to be deployed for the validation and the onboarding process. 2) Amazon can calculate time consumption for validation of seller, on the basis of which engineer's average time frame can be divided and seller validation can be equally distributed. 3) As seller has visibility to monitor and perform different metrics checks, we can use different concepts to increase the sales:

  • Seller can perform 'Correlation Analysis', on the products that have been viewed or sold and on the basis of this seller can advertise more products via providing discount on the correlated product, ex: i-Phone has been advertised with the correlation of charger, cover and protector, with some discount if customer buys all product in 1 go. Also, seller can perform the joint probability and marginal probability with the contingency table showing all relation of different products and can calculate which product will be best suited as correlation.
  • Seller can perform different test to validate Hypothesis Testing, so sales and profits increase.
  • As per the customer's feedback seller can check customer sentiments and can improve the sales. 4) We can compare 5-10 years historic data for freedom sales with the increase in new customers. As freedom sale for 2022 is going to be start, Amazon can compare the relation between the new users sign-on with actual purchase. On the basis of increase in customer probability we can forecast the load and performance for our servers and can scale as per the need.
#¶

PART-B¶

#¶

1. Read the data set, clean the data and prepare final dataset to be used for analysis.¶

In [14]:
basketball_data = pd.read_csv('Basketball.csv')
basketball_data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 61 entries, 0 to 60
Data columns (total 13 columns):
 #   Column               Non-Null Count  Dtype 
---  ------               --------------  ----- 
 0   Team                 61 non-null     object
 1   Tournament           61 non-null     int64 
 2   Score                61 non-null     object
 3   PlayedGames          61 non-null     object
 4   WonGames             61 non-null     object
 5   DrawnGames           61 non-null     object
 6   LostGames            61 non-null     object
 7   BasketScored         61 non-null     object
 8   BasketGiven          61 non-null     object
 9   TournamentChampion   61 non-null     object
 10  Runner-up            61 non-null     object
 11  TeamLaunch           61 non-null     object
 12  HighestPositionHeld  61 non-null     int64 
dtypes: int64(2), object(11)
memory usage: 6.3+ KB

Basketball.csv contains 2 columns as int type and 11 columns as object type Total columns are 13 Total rows are 61

In [15]:
# Checking few data using head()
basketball_data.head()
Out[15]:
Team Tournament Score PlayedGames WonGames DrawnGames LostGames BasketScored BasketGiven TournamentChampion Runner-up TeamLaunch HighestPositionHeld
0 Team 1 86 4385 2762 1647 552 563 5947 3140 33 23 1929 1
1 Team 2 86 4262 2762 1581 573 608 5900 3114 25 25 1929 1
2 Team 3 80 3442 2614 1241 598 775 4534 3309 10 8 1929 1
3 Team 4 82 3386 2664 1187 616 861 4398 3469 6 6 1931to32 1
4 Team 5 86 3368 2762 1209 633 920 4631 3700 8 7 1929 1
In [16]:
basketball_data.isnull().sum()
Out[16]:
Team                   0
Tournament             0
Score                  0
PlayedGames            0
WonGames               0
DrawnGames             0
LostGames              0
BasketScored           0
BasketGiven            0
TournamentChampion     0
Runner-up              0
TeamLaunch             0
HighestPositionHeld    0
dtype: int64

There is no empty data, so our basketball dataframe don't need any cleansing for NA/NULL values.

In [17]:
duplicates = basketball_data.duplicated()
sum(duplicates)
Out[17]:
0

No Duplicates found.

As there is no data cleansing required, so will be describing the data

In [18]:
basketball_data.describe(include="all").T
Out[18]:
count unique top freq mean std min 25% 50% 75% max
Team 61 61 Team 1 1 NaN NaN NaN NaN NaN NaN NaN
Tournament 61.0 NaN NaN NaN 24.0 26.827225 1.0 4.0 12.0 38.0 86.0
Score 61 61 4385 1 NaN NaN NaN NaN NaN NaN NaN
PlayedGames 61 53 2762 3 NaN NaN NaN NaN NaN NaN NaN
WonGames 61 59 7 2 NaN NaN NaN NaN NaN NaN NaN
DrawnGames 61 57 14 2 NaN NaN NaN NaN NaN NaN NaN
LostGames 61 56 37 3 NaN NaN NaN NaN NaN NaN NaN
BasketScored 61 60 70 2 NaN NaN NaN NaN NaN NaN NaN
BasketGiven 61 61 3140 1 NaN NaN NaN NaN NaN NaN NaN
TournamentChampion 61 8 - 52 NaN NaN NaN NaN NaN NaN NaN
Runner-up 61 10 - 48 NaN NaN NaN NaN NaN NaN NaN
TeamLaunch 61 47 1929 10 NaN NaN NaN NaN NaN NaN NaN
HighestPositionHeld 61.0 NaN NaN NaN 7.081967 5.276663 1.0 3.0 6.0 10.0 20.0

Using describe():

  • Columns not having numeric dataType. Columns are: Score, PlayedGames, WonGames, DrawnGames, LostGames, BasketScored, BasketGiven, TournamentChampion, Runner-up
In [19]:
basketball_data.columns
Out[19]:
Index(['Team', 'Tournament', 'Score', 'PlayedGames', 'WonGames', 'DrawnGames',
       'LostGames', 'BasketScored', 'BasketGiven', 'TournamentChampion',
       'Runner-up', 'TeamLaunch', 'HighestPositionHeld'],
      dtype='object')

Like all columns are in Upper Camel Case, so lets change Runner-up to the same format as others.

In [20]:
# Checking whole datasets
basketball_data
Out[20]:
Team Tournament Score PlayedGames WonGames DrawnGames LostGames BasketScored BasketGiven TournamentChampion Runner-up TeamLaunch HighestPositionHeld
0 Team 1 86 4385 2762 1647 552 563 5947 3140 33 23 1929 1
1 Team 2 86 4262 2762 1581 573 608 5900 3114 25 25 1929 1
2 Team 3 80 3442 2614 1241 598 775 4534 3309 10 8 1929 1
3 Team 4 82 3386 2664 1187 616 861 4398 3469 6 6 1931to32 1
4 Team 5 86 3368 2762 1209 633 920 4631 3700 8 7 1929 1
... ... ... ... ... ... ... ... ... ... ... ... ... ...
56 Team 57 1 34 38 8 10 20 38 66 - - 2009-10 20
57 Team 58 1 22 30 7 8 15 37 57 - - 1956-57 16
58 Team 59 1 19 30 7 5 18 51 85 - - 1951~52 16
59 Team 60 1 14 30 5 4 21 34 65 - - 1955-56 15
60 Team 61 1 - - - - - - - - - 2017~18 9

61 rows × 13 columns

While analyzing data got some points:

  • Many columns have values as '-', where we can replace it with 0, as these columns specifies Numeric datatype.
  • While checking TeamLaunch column values, there is uneven data distribution and as per the description provided i.e. "Year the team was launched on professional basketball.", data need to have only year, like some values are present(1929), so we can format this column.
  • Let's add 1 more column regarding the TeamLaunch Categorisation, as in further analysis it will be helpful, so that we can find the which team is old and new with value as: Oldest, Old, New and Newest.
In [21]:
# Renaming column Runner-up to RunnerUp
basketball_data.rename(columns = {'Runner-up': 'RunnerUp'}, inplace = True)
In [22]:
#Let's keep a duplicate copy of this file, as we will be doing some data cleaning and data imputation.
basketball_data_copy = basketball_data.copy(deep=True)
basketball_data_copy.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 61 entries, 0 to 60
Data columns (total 13 columns):
 #   Column               Non-Null Count  Dtype 
---  ------               --------------  ----- 
 0   Team                 61 non-null     object
 1   Tournament           61 non-null     int64 
 2   Score                61 non-null     object
 3   PlayedGames          61 non-null     object
 4   WonGames             61 non-null     object
 5   DrawnGames           61 non-null     object
 6   LostGames            61 non-null     object
 7   BasketScored         61 non-null     object
 8   BasketGiven          61 non-null     object
 9   TournamentChampion   61 non-null     object
 10  RunnerUp             61 non-null     object
 11  TeamLaunch           61 non-null     object
 12  HighestPositionHeld  61 non-null     int64 
dtypes: int64(2), object(11)
memory usage: 6.3+ KB
In [23]:
# Slicing data values of TeamLaunch column
basketball_data['TeamLaunch']=basketball_data['TeamLaunch'].str.slice(0,4)

# Replace all values which is having '-' as value with '0'
basketball_data=basketball_data.replace('-',0)

# Show Data-Set
basketball_data
Out[23]:
Team Tournament Score PlayedGames WonGames DrawnGames LostGames BasketScored BasketGiven TournamentChampion RunnerUp TeamLaunch HighestPositionHeld
0 Team 1 86 4385 2762 1647 552 563 5947 3140 33 23 1929 1
1 Team 2 86 4262 2762 1581 573 608 5900 3114 25 25 1929 1
2 Team 3 80 3442 2614 1241 598 775 4534 3309 10 8 1929 1
3 Team 4 82 3386 2664 1187 616 861 4398 3469 6 6 1931 1
4 Team 5 86 3368 2762 1209 633 920 4631 3700 8 7 1929 1
... ... ... ... ... ... ... ... ... ... ... ... ... ...
56 Team 57 1 34 38 8 10 20 38 66 0 0 2009 20
57 Team 58 1 22 30 7 8 15 37 57 0 0 1956 16
58 Team 59 1 19 30 7 5 18 51 85 0 0 1951 16
59 Team 60 1 14 30 5 4 21 34 65 0 0 1955 15
60 Team 61 1 0 0 0 0 0 0 0 0 0 2017 9

61 rows × 13 columns

Data cleansing and data imputation has been done, lets make all columns dType as numeric except Team column. After inspecting the rows, Team 61 data isn't sufficient, so we can remove index 60.

In [24]:
columns=basketball_data.drop('Team', axis=1).columns
basketball_data[columns]=basketball_data[columns].apply(pd.to_numeric)

# Add new column named TeamLaunchCategorisation
basketball_data['TeamLaunchCategorisation']=pd.qcut(x=basketball_data.TeamLaunch, q=4, labels=['Oldest', 'Old', 'New', 'Newest'])

basketball_data.drop(60, inplace=True)
basketball_data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 60 entries, 0 to 59
Data columns (total 14 columns):
 #   Column                    Non-Null Count  Dtype   
---  ------                    --------------  -----   
 0   Team                      60 non-null     object  
 1   Tournament                60 non-null     int64   
 2   Score                     60 non-null     int64   
 3   PlayedGames               60 non-null     int64   
 4   WonGames                  60 non-null     int64   
 5   DrawnGames                60 non-null     int64   
 6   LostGames                 60 non-null     int64   
 7   BasketScored              60 non-null     int64   
 8   BasketGiven               60 non-null     int64   
 9   TournamentChampion        60 non-null     int64   
 10  RunnerUp                  60 non-null     int64   
 11  TeamLaunch                60 non-null     int64   
 12  HighestPositionHeld       60 non-null     int64   
 13  TeamLaunchCategorisation  60 non-null     category
dtypes: category(1), int64(12), object(1)
memory usage: 6.5+ KB
In [25]:
# Let's inspect/analyze data on last time
basketball_data
Out[25]:
Team Tournament Score PlayedGames WonGames DrawnGames LostGames BasketScored BasketGiven TournamentChampion RunnerUp TeamLaunch HighestPositionHeld TeamLaunchCategorisation
0 Team 1 86 4385 2762 1647 552 563 5947 3140 33 23 1929 1 Oldest
1 Team 2 86 4262 2762 1581 573 608 5900 3114 25 25 1929 1 Oldest
2 Team 3 80 3442 2614 1241 598 775 4534 3309 10 8 1929 1 Oldest
3 Team 4 82 3386 2664 1187 616 861 4398 3469 6 6 1931 1 Oldest
4 Team 5 86 3368 2762 1209 633 920 4631 3700 8 7 1929 1 Oldest
5 Team 6 73 2819 2408 990 531 887 3680 3373 1 4 1934 1 Oldest
6 Team 7 82 2792 2626 948 608 1070 3609 3889 0 0 1929 3 Oldest
7 Team 8 70 2573 2302 864 577 861 3228 3230 2 3 1929 1 Oldest
8 Team 9 58 2109 1986 698 522 766 2683 2847 0 1 1939 2 Old
9 Team 10 51 1884 1728 606 440 682 2159 2492 1 0 1932 1 Oldest
10 Team 11 45 1814 1530 563 392 575 2052 2188 1 5 1941 1 Old
11 Team 12 51 1789 1698 586 389 723 2278 2624 0 0 1939 4 Old
12 Team 13 42 1471 1466 463 384 619 1767 2180 0 0 1948 4 Old
13 Team 14 44 1416 1428 453 336 639 1843 2368 0 1 1929 2 Oldest
14 Team 15 43 1389 1458 471 358 629 1753 2152 0 1 1944 2 Old
15 Team 16 37 1351 1318 426 327 565 1500 1834 0 0 1935 4 Oldest
16 Team 17 36 1314 1255 390 330 535 1421 1763 0 0 1949 4 Old
17 Team 18 38 1174 1192 408 292 492 1642 1951 0 0 1933 3 Oldest
18 Team 19 27 1148 988 333 256 399 1182 1371 0 0 1960 3 New
19 Team 20 33 1020 1096 367 242 487 1347 1746 0 1 1951 2 Old
20 Team 21 17 970 646 266 172 208 892 789 0 1 1998 2 Newest
21 Team 22 23 667 742 218 175 349 819 1157 0 0 1941 6 Old
22 Team 23 17 662 652 189 148 305 760 1088 0 0 1977 8 New
23 Team 24 21 606 678 203 180 295 750 1022 0 0 1959 5 New
24 Team 25 12 553 456 147 112 197 520 633 0 0 2004 6 Newest
25 Team 26 20 538 628 184 149 295 716 1050 0 0 1935 5 Oldest
26 Team 27 13 510 494 155 128 211 619 744 0 0 1961 5 New
27 Team 28 18 445 586 145 143 298 607 992 0 0 1940 11 Old
28 Team 29 12 421 380 125 81 174 458 623 0 0 1930 6 Oldest
29 Team 30 11 416 402 113 95 194 430 632 0 0 1963 6 New
30 Team 31 12 375 423 123 102 198 422 581 0 0 1974 7 New
31 Team 32 14 353 426 129 95 202 492 720 0 0 1943 4 Old
32 Team 33 12 343 448 104 127 217 393 662 0 0 1977 12 New
33 Team 34 9 293 346 96 92 158 291 489 0 0 1987 7 Newest
34 Team 35 11 285 334 103 79 152 419 588 0 0 1941 4 Old
35 Team 36 7 277 270 76 76 118 320 410 0 0 1991 7 Newest
36 Team 37 6 242 228 62 56 110 244 366 0 0 2007 8 Newest
37 Team 38 9 230 282 82 63 137 285 430 0 0 1962 5 New
38 Team 39 4 190 160 52 45 63 199 241 0 0 1994 10 Newest
39 Team 40 5 188 186 50 46 90 202 296 0 0 1978 8 New
40 Team 41 6 168 204 59 50 95 216 310 0 0 1971 12 New
41 Team 42 6 150 180 53 44 83 165 221 0 0 1963 7 New
42 Team 43 4 148 152 37 37 78 155 253 0 0 1999 17 Newest
43 Team 44 3 132 114 35 27 52 139 167 0 0 2014 10 Newest
44 Team 45 7 107 130 43 21 66 227 308 0 0 1929 3 Oldest
45 Team 46 3 96 114 26 44 44 101 139 0 0 1990 9 Newest
46 Team 47 4 91 116 34 16 66 181 295 0 0 1947 7 Old
47 Team 48 2 83 80 20 23 37 62 117 0 0 1996 17 Newest
48 Team 49 2 81 80 19 24 37 70 115 0 0 1995 19 Newest
49 Team 50 4 76 108 30 16 62 145 252 0 0 1945 10 Old
50 Team 51 3 71 90 29 13 48 121 183 0 0 1953 14 New
51 Team 52 4 56 72 21 14 37 153 184 0 0 1929 6 Oldest
52 Team 53 2 52 68 17 18 33 71 116 0 0 1979 10 Newest
53 Team 54 3 42 54 18 6 30 97 131 0 0 1929 8 Oldest
54 Team 55 2 40 68 13 14 41 70 182 0 0 1950 16 Old
55 Team 56 1 35 38 8 11 19 36 55 0 0 2016 17 Newest
56 Team 57 1 34 38 8 10 20 38 66 0 0 2009 20 Newest
57 Team 58 1 22 30 7 8 15 37 57 0 0 1956 16 New
58 Team 59 1 19 30 7 5 18 51 85 0 0 1951 16 Old
59 Team 60 1 14 30 5 4 21 34 65 0 0 1955 15 New
In [26]:
# Let's describe the data
basketball_data.describe(include='all')
Out[26]:
Team Tournament Score PlayedGames WonGames DrawnGames LostGames BasketScored BasketGiven TournamentChampion RunnerUp TeamLaunch HighestPositionHeld TeamLaunchCategorisation
count 60 60.000000 60.000000 60.000000 60.000000 60.000000 60.000000 60.000000 60.000000 60.00000 60.000000 60.000000 60.000000 60
unique 60 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 4
top Team 1 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN Oldest
freq 1 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 17
mean NaN 24.383333 916.450000 810.100000 309.033333 192.083333 308.816667 1159.350000 1159.233333 1.45000 1.433333 1957.950000 7.050000 NaN
std NaN 26.884620 1138.342899 877.465393 408.481395 201.985508 294.508639 1512.063948 1163.946914 5.51554 4.574679 26.646732 5.315232 NaN
min NaN 1.000000 14.000000 30.000000 5.000000 4.000000 15.000000 34.000000 55.000000 0.00000 0.000000 1929.000000 1.000000 NaN
25% NaN 4.000000 104.250000 115.500000 34.750000 26.250000 62.750000 154.500000 236.000000 0.00000 0.000000 1934.750000 3.000000 NaN
50% NaN 12.000000 395.500000 424.500000 124.000000 98.500000 197.500000 444.000000 632.500000 0.00000 0.000000 1950.500000 6.000000 NaN
75% NaN 39.000000 1360.500000 1345.500000 432.750000 331.500000 563.500000 1669.750000 2001.250000 0.00000 0.000000 1977.250000 10.000000 NaN
max NaN 86.000000 4385.000000 2762.000000 1647.000000 633.000000 1070.000000 5947.000000 3889.000000 33.00000 25.000000 2016.000000 20.000000 NaN

DataSet has been ready for further analysis. Dataset to be used: basketball_data

2. Perform detailed statistical analysis and EDA using univariate, bi-variate and multivariate EDA techniques to get data driven insights on recommending which teams they can approach which will be a deal win for them. Also as a data and statistics expert you have to develop a detailed performance report using this data.¶

Uni-variate Analysis using EDA¶

Checking the overall distribution of continuous data variables.

In [27]:
columns=basketball_data.drop(['Team','TeamLaunchCategorisation'], axis=1).columns
fig=plt.figure(figsize=(20,10))
for i in range(0,len(columns)):
   ax=fig.add_subplot(3,4,i+1)
   sns.distplot(basketball_data[columns[i]])
   ax.set_title(columns[i],color='orange',weight='bold',fontsize=16)
plt.tight_layout()
plt.show()

Above distplot graph represents that there is skewness in which tends to have tail towards higher values, so this type of skewness is called Right Skewed, that means mean would be greater than median for all values, so it can be termed as both Right Skewed or Positive Skewed.

In [28]:
# Comparison of mean and median to check mean and median fact
basketball_data.mean()>basketball_data.median()
Out[28]:
Tournament             True
Score                  True
PlayedGames            True
WonGames               True
DrawnGames             True
LostGames              True
BasketScored           True
BasketGiven            True
TournamentChampion     True
RunnerUp               True
TeamLaunch             True
HighestPositionHeld    True
dtype: bool

Hence, its proved that dataset's values tends to be right skewed.

As now, we have seen the skewness of graph using distplot, now let's see the 5-Number Summary plot or BoxPlot for each data

In [29]:
columns=basketball_data.drop(['Team','TeamLaunchCategorisation'], axis=1).columns
fig=plt.figure(figsize=(20,10))
for i in range(0,len(columns)):
   ax=fig.add_subplot(3,4,i+1)
   sns.boxplot(basketball_data[columns[i]])
   ax.set_title(columns[i],color='orange',weight='bold',fontsize=16)
plt.tight_layout()
plt.show()

Describing above graphical representation:

  • As we have inspected the data, TournamentChampion and RunnerUP mainly have values as 0 and as per describe() method, Q3 for both columns is 0. That's why in boxplot all values are expected as outliers.
  • Score, WonGames and BasketScored have limited number of outliers and all outliers are on IQR+Q3 side.
  • In a boxplot, when the median(Q2) is closer to the bottom quartile(Q1) of box and the whisker is shorter on lower end of box, then we say that the distribution is Positively Skewed or Right Skewed.

As now, we have seen the skewness of graph distplot and bocplot, now let's see histplot with statistical data for each column

In [30]:
columns=basketball_data.drop(['Team','TeamLaunchCategorisation'], axis=1).columns
fig=plt.figure(figsize=(20,10))
for i in range(0,len(columns)):
   ax=fig.add_subplot(3,4,i+1)
   plt.hist(basketball_data[columns[i]])
   plt.axvline(basketball_data[columns[i]].mean(), c='g')
   plt.axvline(basketball_data[columns[i]].median(), c='r')
   plt.axvline(basketball_data[columns[i]].mode()[0], c='y')
   ax.set_title(columns[i],color='orange',weight='bold',fontsize=16)
plt.tight_layout()
plt.show()

Describing above graphical representation:

  • As we can see mean value for all plots is greater than median value, that's why its Right Skewed
  • In all plots, 1st bin is having main amount of data, that's why mode is represented in 1st bin
In [31]:
plt.figure(figsize=(10,5))
sns.countplot(data=basketball_data, x='TeamLaunchCategorisation')
plt.xlabel('TeamLaunchCategorisation', weight='bold', fontsize=16)
plt.ylabel('Count', weight='bold', fontsize=16)
plt.show()

Describing above graphical representation

  • Oldest category ranges form 1929-1935 and contains 17 teams.
  • Old category ranges form 1939-1951 and contains 15 teams.
  • New category ranges form 1953-1978 and contains 14 teams.
  • Newest category ranges form 1979-2016 and contains 14 teams.

Same can be represented by pie-chart as below

In [32]:
fig = px.pie(basketball_data, values=basketball_data.TeamLaunchCategorisation.value_counts(), names=basketball_data.TeamLaunchCategorisation.value_counts().index.categories,
             color_discrete_sequence=px.colors.sequential.RdBu, title='TeamLaunchCategorisation', hole=0.3)
fig.show()

Bi-variate Analysis using EDA¶

Let's make graphical representation with Team as category and PlayedGames, WonGames, DrawnGames, LostGames as numerical

In [33]:
basketball_data_sorted_by_played_games=basketball_data.sort_values('PlayedGames', ascending=False)
plt.figure(figsize=(35,15))
sns.scatterplot(data=basketball_data_sorted_by_played_games, x='Team', y='PlayedGames', hue='TeamLaunchCategorisation')
plt.xticks(rotation=45, weight='bold', fontsize=12)
plt.xlabel('Team', weight='bold', fontsize=16)
plt.ylabel('PlayedGames', weight='bold', fontsize=16)
for i in range(0, len(basketball_data_sorted_by_played_games.TeamLaunch)):
   plt.annotate(basketball_data_sorted_by_played_games.TeamLaunch[i], (basketball_data_sorted_by_played_games.Team[i], basketball_data_sorted_by_played_games.PlayedGames[i]))
plt.title('Team vs PlayedGames with TeamLaunch and TeamLaunchCategorisation as bifurcation', weight='bold', fontsize=30)
plt.legend(prop={'size': 20})
plt.show()

Describing above graphical representation:

  • Team 1,2,5 are the oldest team which were launched in 1929 and have played maximum games, i.e. 2762
  • Team 54 is the oldest team which was launched in 1929 and have played the lowest matches in the oldest category i.e., 68

To plot graph for win, lost, draw, champion and runnerup games, we have to convert the data values in standard form, i.e., (expectedValue/totalValue)*100

In [34]:
basketball_data['Win']=np.round((basketball_data.WonGames/basketball_data.PlayedGames)*100,2)
basketball_data['Lost']=np.round((basketball_data.LostGames/basketball_data.PlayedGames)*100,2)
basketball_data['Draw']=np.round((basketball_data.DrawnGames/basketball_data.PlayedGames)*100,2)
basketball_data['Champion%']=np.round((basketball_data.TournamentChampion/basketball_data.Tournament)*100,2)
basketball_data['RunnerUp%']=np.round((basketball_data.TournamentChampion/basketball_data.Tournament)*100,2)
basketball_data
Out[34]:
Team Tournament Score PlayedGames WonGames DrawnGames LostGames BasketScored BasketGiven TournamentChampion RunnerUp TeamLaunch HighestPositionHeld TeamLaunchCategorisation Win Lost Draw Champion% RunnerUp%
0 Team 1 86 4385 2762 1647 552 563 5947 3140 33 23 1929 1 Oldest 59.63 20.38 19.99 38.37 38.37
1 Team 2 86 4262 2762 1581 573 608 5900 3114 25 25 1929 1 Oldest 57.24 22.01 20.75 29.07 29.07
2 Team 3 80 3442 2614 1241 598 775 4534 3309 10 8 1929 1 Oldest 47.48 29.65 22.88 12.50 12.50
3 Team 4 82 3386 2664 1187 616 861 4398 3469 6 6 1931 1 Oldest 44.56 32.32 23.12 7.32 7.32
4 Team 5 86 3368 2762 1209 633 920 4631 3700 8 7 1929 1 Oldest 43.77 33.31 22.92 9.30 9.30
5 Team 6 73 2819 2408 990 531 887 3680 3373 1 4 1934 1 Oldest 41.11 36.84 22.05 1.37 1.37
6 Team 7 82 2792 2626 948 608 1070 3609 3889 0 0 1929 3 Oldest 36.10 40.75 23.15 0.00 0.00
7 Team 8 70 2573 2302 864 577 861 3228 3230 2 3 1929 1 Oldest 37.53 37.40 25.07 2.86 2.86
8 Team 9 58 2109 1986 698 522 766 2683 2847 0 1 1939 2 Old 35.15 38.57 26.28 0.00 0.00
9 Team 10 51 1884 1728 606 440 682 2159 2492 1 0 1932 1 Oldest 35.07 39.47 25.46 1.96 1.96
10 Team 11 45 1814 1530 563 392 575 2052 2188 1 5 1941 1 Old 36.80 37.58 25.62 2.22 2.22
11 Team 12 51 1789 1698 586 389 723 2278 2624 0 0 1939 4 Old 34.51 42.58 22.91 0.00 0.00
12 Team 13 42 1471 1466 463 384 619 1767 2180 0 0 1948 4 Old 31.58 42.22 26.19 0.00 0.00
13 Team 14 44 1416 1428 453 336 639 1843 2368 0 1 1929 2 Oldest 31.72 44.75 23.53 0.00 0.00
14 Team 15 43 1389 1458 471 358 629 1753 2152 0 1 1944 2 Old 32.30 43.14 24.55 0.00 0.00
15 Team 16 37 1351 1318 426 327 565 1500 1834 0 0 1935 4 Oldest 32.32 42.87 24.81 0.00 0.00
16 Team 17 36 1314 1255 390 330 535 1421 1763 0 0 1949 4 Old 31.08 42.63 26.29 0.00 0.00
17 Team 18 38 1174 1192 408 292 492 1642 1951 0 0 1933 3 Oldest 34.23 41.28 24.50 0.00 0.00
18 Team 19 27 1148 988 333 256 399 1182 1371 0 0 1960 3 New 33.70 40.38 25.91 0.00 0.00
19 Team 20 33 1020 1096 367 242 487 1347 1746 0 1 1951 2 Old 33.49 44.43 22.08 0.00 0.00
20 Team 21 17 970 646 266 172 208 892 789 0 1 1998 2 Newest 41.18 32.20 26.63 0.00 0.00
21 Team 22 23 667 742 218 175 349 819 1157 0 0 1941 6 Old 29.38 47.04 23.58 0.00 0.00
22 Team 23 17 662 652 189 148 305 760 1088 0 0 1977 8 New 28.99 46.78 22.70 0.00 0.00
23 Team 24 21 606 678 203 180 295 750 1022 0 0 1959 5 New 29.94 43.51 26.55 0.00 0.00
24 Team 25 12 553 456 147 112 197 520 633 0 0 2004 6 Newest 32.24 43.20 24.56 0.00 0.00
25 Team 26 20 538 628 184 149 295 716 1050 0 0 1935 5 Oldest 29.30 46.97 23.73 0.00 0.00
26 Team 27 13 510 494 155 128 211 619 744 0 0 1961 5 New 31.38 42.71 25.91 0.00 0.00
27 Team 28 18 445 586 145 143 298 607 992 0 0 1940 11 Old 24.74 50.85 24.40 0.00 0.00
28 Team 29 12 421 380 125 81 174 458 623 0 0 1930 6 Oldest 32.89 45.79 21.32 0.00 0.00
29 Team 30 11 416 402 113 95 194 430 632 0 0 1963 6 New 28.11 48.26 23.63 0.00 0.00
30 Team 31 12 375 423 123 102 198 422 581 0 0 1974 7 New 29.08 46.81 24.11 0.00 0.00
31 Team 32 14 353 426 129 95 202 492 720 0 0 1943 4 Old 30.28 47.42 22.30 0.00 0.00
32 Team 33 12 343 448 104 127 217 393 662 0 0 1977 12 New 23.21 48.44 28.35 0.00 0.00
33 Team 34 9 293 346 96 92 158 291 489 0 0 1987 7 Newest 27.75 45.66 26.59 0.00 0.00
34 Team 35 11 285 334 103 79 152 419 588 0 0 1941 4 Old 30.84 45.51 23.65 0.00 0.00
35 Team 36 7 277 270 76 76 118 320 410 0 0 1991 7 Newest 28.15 43.70 28.15 0.00 0.00
36 Team 37 6 242 228 62 56 110 244 366 0 0 2007 8 Newest 27.19 48.25 24.56 0.00 0.00
37 Team 38 9 230 282 82 63 137 285 430 0 0 1962 5 New 29.08 48.58 22.34 0.00 0.00
38 Team 39 4 190 160 52 45 63 199 241 0 0 1994 10 Newest 32.50 39.38 28.12 0.00 0.00
39 Team 40 5 188 186 50 46 90 202 296 0 0 1978 8 New 26.88 48.39 24.73 0.00 0.00
40 Team 41 6 168 204 59 50 95 216 310 0 0 1971 12 New 28.92 46.57 24.51 0.00 0.00
41 Team 42 6 150 180 53 44 83 165 221 0 0 1963 7 New 29.44 46.11 24.44 0.00 0.00
42 Team 43 4 148 152 37 37 78 155 253 0 0 1999 17 Newest 24.34 51.32 24.34 0.00 0.00
43 Team 44 3 132 114 35 27 52 139 167 0 0 2014 10 Newest 30.70 45.61 23.68 0.00 0.00
44 Team 45 7 107 130 43 21 66 227 308 0 0 1929 3 Oldest 33.08 50.77 16.15 0.00 0.00
45 Team 46 3 96 114 26 44 44 101 139 0 0 1990 9 Newest 22.81 38.60 38.60 0.00 0.00
46 Team 47 4 91 116 34 16 66 181 295 0 0 1947 7 Old 29.31 56.90 13.79 0.00 0.00
47 Team 48 2 83 80 20 23 37 62 117 0 0 1996 17 Newest 25.00 46.25 28.75 0.00 0.00
48 Team 49 2 81 80 19 24 37 70 115 0 0 1995 19 Newest 23.75 46.25 30.00 0.00 0.00
49 Team 50 4 76 108 30 16 62 145 252 0 0 1945 10 Old 27.78 57.41 14.81 0.00 0.00
50 Team 51 3 71 90 29 13 48 121 183 0 0 1953 14 New 32.22 53.33 14.44 0.00 0.00
51 Team 52 4 56 72 21 14 37 153 184 0 0 1929 6 Oldest 29.17 51.39 19.44 0.00 0.00
52 Team 53 2 52 68 17 18 33 71 116 0 0 1979 10 Newest 25.00 48.53 26.47 0.00 0.00
53 Team 54 3 42 54 18 6 30 97 131 0 0 1929 8 Oldest 33.33 55.56 11.11 0.00 0.00
54 Team 55 2 40 68 13 14 41 70 182 0 0 1950 16 Old 19.12 60.29 20.59 0.00 0.00
55 Team 56 1 35 38 8 11 19 36 55 0 0 2016 17 Newest 21.05 50.00 28.95 0.00 0.00
56 Team 57 1 34 38 8 10 20 38 66 0 0 2009 20 Newest 21.05 52.63 26.32 0.00 0.00
57 Team 58 1 22 30 7 8 15 37 57 0 0 1956 16 New 23.33 50.00 26.67 0.00 0.00
58 Team 59 1 19 30 7 5 18 51 85 0 0 1951 16 Old 23.33 60.00 16.67 0.00 0.00
59 Team 60 1 14 30 5 4 21 34 65 0 0 1955 15 New 16.67 70.00 13.33 0.00 0.00
In [35]:
basketball_data_sorted_by_won_games=basketball_data.sort_values('Win', ascending=False)
plt.figure(figsize=(35,15))
sns.scatterplot(data=basketball_data_sorted_by_won_games, x='Team', y='Win', hue='TeamLaunchCategorisation')
plt.xticks(rotation=45, weight='bold', fontsize=12)
plt.xlabel('Team', weight='bold', fontsize=16)
plt.ylabel('WonGames %', weight='bold', fontsize=16)
for i in range(0, len(basketball_data_sorted_by_won_games.TeamLaunch)):
   plt.annotate(basketball_data_sorted_by_won_games.TeamLaunch[i], (basketball_data_sorted_by_won_games.Team[i], basketball_data_sorted_by_won_games.Win[i]))
plt.title('Team vs WonGames % with TeamLaunch and TeamLaunchCategorisation as bifurcation', weight='bold', fontsize=30)
plt.legend(prop={'size': 20})
plt.show()

Describing above graphical representation:

  • Team 1 has won maximum games and is the oldest team of 1929 with 59.63% of games win.
  • Team 60 has won the least games 16.67%.
In [36]:
basketball_data_sorted_by_lost_games=basketball_data.sort_values('Lost', ascending=False)
plt.figure(figsize=(35,15))
sns.scatterplot(data=basketball_data_sorted_by_lost_games, x='Team', y='Lost', hue='TeamLaunchCategorisation')
plt.xticks(rotation=45, weight='bold', fontsize=12)
plt.xlabel('Team', weight='bold', fontsize=16)
plt.ylabel('LostGames %', weight='bold', fontsize=16)
for i in range(0, len(basketball_data_sorted_by_lost_games.TeamLaunch)):
   plt.annotate(basketball_data_sorted_by_lost_games.TeamLaunch[i], (basketball_data_sorted_by_lost_games.Team[i], basketball_data_sorted_by_lost_games.Lost[i]))
plt.title('Team vs LostGames % with TeamLaunch and TeamLaunchCategorisation as bifurcation', weight='bold', fontsize=30)
plt.legend(prop={'size': 20})
plt.show()

Describing above graphical representation:

  • Team 60 has lost maximum games and is the new team from 1955 with 70% of games loose.
  • Team 1 has lost the least games 20.38%.
In [37]:
basketball_data_sorted_by_drawn_games=basketball_data.sort_values('Draw', ascending=False)
plt.figure(figsize=(35,15))
sns.scatterplot(data=basketball_data_sorted_by_drawn_games, x='Team', y='Draw', hue='TeamLaunchCategorisation')
plt.xticks(rotation=45, weight='bold', fontsize=12)
plt.xlabel('Team', weight='bold', fontsize=16)
plt.ylabel('DrawnGames %', weight='bold', fontsize=16)
for i in range(0, len(basketball_data_sorted_by_drawn_games.TeamLaunch)):
   plt.annotate(basketball_data_sorted_by_drawn_games.TeamLaunch[i], (basketball_data_sorted_by_drawn_games.Team[i], basketball_data_sorted_by_drawn_games.Draw[i]))
plt.title('Team vs DrawnGames % with TeamLaunch and TeamLaunchCategorisation as bifurcation', weight='bold', fontsize=30)
plt.legend(prop={'size': 20})
plt.show()

Describing above graphical representation:

  • Team 46 has maximum games as draw and is the newest team from 1990 with 38.60% of games drawn.
  • Team 54 has the least drawn games 11.11%.
In [38]:
# Graphical Representation for Team Category and Games Played
basketball_data_team_categorisation=basketball_data.sort_values('PlayedGames').groupby('TeamLaunchCategorisation')['PlayedGames'].sum().reset_index()
plt.figure(figsize=(15,5))
sns.barplot(x ='TeamLaunchCategorisation',y='PlayedGames',data = basketball_data_team_categorisation)
plt.xlabel('TeamLaunchCategorisation', weight='bold', fontsize=16)
plt.ylabel('PlayedGames', weight='bold', fontsize=16)
plt.title("PlayedGames across Team Category",size=15)
plt.show()

Describing above graphical representation:

  • Oldest category have played 27,830 matches.
  • Old category have played 12,899 matches.
  • New category have played 5,087 matches.
  • Newest category have played 2,790 matches
In [39]:
basketball_data
Out[39]:
Team Tournament Score PlayedGames WonGames DrawnGames LostGames BasketScored BasketGiven TournamentChampion RunnerUp TeamLaunch HighestPositionHeld TeamLaunchCategorisation Win Lost Draw Champion% RunnerUp%
0 Team 1 86 4385 2762 1647 552 563 5947 3140 33 23 1929 1 Oldest 59.63 20.38 19.99 38.37 38.37
1 Team 2 86 4262 2762 1581 573 608 5900 3114 25 25 1929 1 Oldest 57.24 22.01 20.75 29.07 29.07
2 Team 3 80 3442 2614 1241 598 775 4534 3309 10 8 1929 1 Oldest 47.48 29.65 22.88 12.50 12.50
3 Team 4 82 3386 2664 1187 616 861 4398 3469 6 6 1931 1 Oldest 44.56 32.32 23.12 7.32 7.32
4 Team 5 86 3368 2762 1209 633 920 4631 3700 8 7 1929 1 Oldest 43.77 33.31 22.92 9.30 9.30
5 Team 6 73 2819 2408 990 531 887 3680 3373 1 4 1934 1 Oldest 41.11 36.84 22.05 1.37 1.37
6 Team 7 82 2792 2626 948 608 1070 3609 3889 0 0 1929 3 Oldest 36.10 40.75 23.15 0.00 0.00
7 Team 8 70 2573 2302 864 577 861 3228 3230 2 3 1929 1 Oldest 37.53 37.40 25.07 2.86 2.86
8 Team 9 58 2109 1986 698 522 766 2683 2847 0 1 1939 2 Old 35.15 38.57 26.28 0.00 0.00
9 Team 10 51 1884 1728 606 440 682 2159 2492 1 0 1932 1 Oldest 35.07 39.47 25.46 1.96 1.96
10 Team 11 45 1814 1530 563 392 575 2052 2188 1 5 1941 1 Old 36.80 37.58 25.62 2.22 2.22
11 Team 12 51 1789 1698 586 389 723 2278 2624 0 0 1939 4 Old 34.51 42.58 22.91 0.00 0.00
12 Team 13 42 1471 1466 463 384 619 1767 2180 0 0 1948 4 Old 31.58 42.22 26.19 0.00 0.00
13 Team 14 44 1416 1428 453 336 639 1843 2368 0 1 1929 2 Oldest 31.72 44.75 23.53 0.00 0.00
14 Team 15 43 1389 1458 471 358 629 1753 2152 0 1 1944 2 Old 32.30 43.14 24.55 0.00 0.00
15 Team 16 37 1351 1318 426 327 565 1500 1834 0 0 1935 4 Oldest 32.32 42.87 24.81 0.00 0.00
16 Team 17 36 1314 1255 390 330 535 1421 1763 0 0 1949 4 Old 31.08 42.63 26.29 0.00 0.00
17 Team 18 38 1174 1192 408 292 492 1642 1951 0 0 1933 3 Oldest 34.23 41.28 24.50 0.00 0.00
18 Team 19 27 1148 988 333 256 399 1182 1371 0 0 1960 3 New 33.70 40.38 25.91 0.00 0.00
19 Team 20 33 1020 1096 367 242 487 1347 1746 0 1 1951 2 Old 33.49 44.43 22.08 0.00 0.00
20 Team 21 17 970 646 266 172 208 892 789 0 1 1998 2 Newest 41.18 32.20 26.63 0.00 0.00
21 Team 22 23 667 742 218 175 349 819 1157 0 0 1941 6 Old 29.38 47.04 23.58 0.00 0.00
22 Team 23 17 662 652 189 148 305 760 1088 0 0 1977 8 New 28.99 46.78 22.70 0.00 0.00
23 Team 24 21 606 678 203 180 295 750 1022 0 0 1959 5 New 29.94 43.51 26.55 0.00 0.00
24 Team 25 12 553 456 147 112 197 520 633 0 0 2004 6 Newest 32.24 43.20 24.56 0.00 0.00
25 Team 26 20 538 628 184 149 295 716 1050 0 0 1935 5 Oldest 29.30 46.97 23.73 0.00 0.00
26 Team 27 13 510 494 155 128 211 619 744 0 0 1961 5 New 31.38 42.71 25.91 0.00 0.00
27 Team 28 18 445 586 145 143 298 607 992 0 0 1940 11 Old 24.74 50.85 24.40 0.00 0.00
28 Team 29 12 421 380 125 81 174 458 623 0 0 1930 6 Oldest 32.89 45.79 21.32 0.00 0.00
29 Team 30 11 416 402 113 95 194 430 632 0 0 1963 6 New 28.11 48.26 23.63 0.00 0.00
30 Team 31 12 375 423 123 102 198 422 581 0 0 1974 7 New 29.08 46.81 24.11 0.00 0.00
31 Team 32 14 353 426 129 95 202 492 720 0 0 1943 4 Old 30.28 47.42 22.30 0.00 0.00
32 Team 33 12 343 448 104 127 217 393 662 0 0 1977 12 New 23.21 48.44 28.35 0.00 0.00
33 Team 34 9 293 346 96 92 158 291 489 0 0 1987 7 Newest 27.75 45.66 26.59 0.00 0.00
34 Team 35 11 285 334 103 79 152 419 588 0 0 1941 4 Old 30.84 45.51 23.65 0.00 0.00
35 Team 36 7 277 270 76 76 118 320 410 0 0 1991 7 Newest 28.15 43.70 28.15 0.00 0.00
36 Team 37 6 242 228 62 56 110 244 366 0 0 2007 8 Newest 27.19 48.25 24.56 0.00 0.00
37 Team 38 9 230 282 82 63 137 285 430 0 0 1962 5 New 29.08 48.58 22.34 0.00 0.00
38 Team 39 4 190 160 52 45 63 199 241 0 0 1994 10 Newest 32.50 39.38 28.12 0.00 0.00
39 Team 40 5 188 186 50 46 90 202 296 0 0 1978 8 New 26.88 48.39 24.73 0.00 0.00
40 Team 41 6 168 204 59 50 95 216 310 0 0 1971 12 New 28.92 46.57 24.51 0.00 0.00
41 Team 42 6 150 180 53 44 83 165 221 0 0 1963 7 New 29.44 46.11 24.44 0.00 0.00
42 Team 43 4 148 152 37 37 78 155 253 0 0 1999 17 Newest 24.34 51.32 24.34 0.00 0.00
43 Team 44 3 132 114 35 27 52 139 167 0 0 2014 10 Newest 30.70 45.61 23.68 0.00 0.00
44 Team 45 7 107 130 43 21 66 227 308 0 0 1929 3 Oldest 33.08 50.77 16.15 0.00 0.00
45 Team 46 3 96 114 26 44 44 101 139 0 0 1990 9 Newest 22.81 38.60 38.60 0.00 0.00
46 Team 47 4 91 116 34 16 66 181 295 0 0 1947 7 Old 29.31 56.90 13.79 0.00 0.00
47 Team 48 2 83 80 20 23 37 62 117 0 0 1996 17 Newest 25.00 46.25 28.75 0.00 0.00
48 Team 49 2 81 80 19 24 37 70 115 0 0 1995 19 Newest 23.75 46.25 30.00 0.00 0.00
49 Team 50 4 76 108 30 16 62 145 252 0 0 1945 10 Old 27.78 57.41 14.81 0.00 0.00
50 Team 51 3 71 90 29 13 48 121 183 0 0 1953 14 New 32.22 53.33 14.44 0.00 0.00
51 Team 52 4 56 72 21 14 37 153 184 0 0 1929 6 Oldest 29.17 51.39 19.44 0.00 0.00
52 Team 53 2 52 68 17 18 33 71 116 0 0 1979 10 Newest 25.00 48.53 26.47 0.00 0.00
53 Team 54 3 42 54 18 6 30 97 131 0 0 1929 8 Oldest 33.33 55.56 11.11 0.00 0.00
54 Team 55 2 40 68 13 14 41 70 182 0 0 1950 16 Old 19.12 60.29 20.59 0.00 0.00
55 Team 56 1 35 38 8 11 19 36 55 0 0 2016 17 Newest 21.05 50.00 28.95 0.00 0.00
56 Team 57 1 34 38 8 10 20 38 66 0 0 2009 20 Newest 21.05 52.63 26.32 0.00 0.00
57 Team 58 1 22 30 7 8 15 37 57 0 0 1956 16 New 23.33 50.00 26.67 0.00 0.00
58 Team 59 1 19 30 7 5 18 51 85 0 0 1951 16 Old 23.33 60.00 16.67 0.00 0.00
59 Team 60 1 14 30 5 4 21 34 65 0 0 1955 15 New 16.67 70.00 13.33 0.00 0.00
In [40]:
columns=['Win','Draw','Lost','Champion%','RunnerUp%','HighestPositionHeld']
basketball_data_sorted_by_team_launch=basketball_data.sort_values('TeamLaunch')
fig=plt.figure(figsize=(30,20))
for i in range(0,len(columns)):
   ax=fig.add_subplot(6,1,i+1)
   sns.lineplot(data=basketball_data_sorted_by_team_launch, x='Team', y=columns[i], hue='TeamLaunchCategorisation', marker='D')
   ax.set_title('Performance Report of team for '+columns[i] ,color='orange',weight='bold',fontsize=16)
   plt.xlabel('Team', weight='bold', fontsize=12)
   plt.ylabel(columns[i], weight='bold', fontsize=12)
   plt.xticks(rotation=45)
   plt.legend(loc=0)
plt.tight_layout()
plt.show()

Team performance has been defined in above graphical representation where we can see team specific win, loose, draw and more datas

In [41]:
basketball_data_group_lost_mean=basketball_data.groupby(['TeamLaunchCategorisation'])['Lost'].mean().reset_index()
basketball_data_group_draw_mean=basketball_data.groupby(['TeamLaunchCategorisation'])['Draw'].mean().reset_index()
basketball_data_group_win_mean=basketball_data.groupby(['TeamLaunchCategorisation'])['Win'].mean().reset_index()
basketball_data_group_champ_mean=basketball_data.groupby(['TeamLaunchCategorisation'])['Champion%'].mean().reset_index()
basketball_data_group_runner_mean=basketball_data.groupby(['TeamLaunchCategorisation'])['RunnerUp%'].mean().reset_index()
basketball_data_group_pos_mean=basketball_data.groupby(['TeamLaunchCategorisation'])['HighestPositionHeld'].mean().reset_index()

data_list=[basketball_data_group_win_mean,basketball_data_group_lost_mean,basketball_data_group_draw_mean, basketball_data_group_champ_mean, basketball_data_group_runner_mean, basketball_data_group_pos_mean]
columns=['Win','Lost','Draw','Champion%','RunnerUp%','HighestPositionHeld']
fig=plt.figure(figsize=(15,15))
for i in range(0,len(data_list)):
   ax=fig.add_subplot(6,1,i+1)
   sns.lineplot(data=data_list[i], x='TeamLaunchCategorisation', y=columns[i], marker='H')
   ax.set_title('TeamLaunchCategorisation Avg '+columns[i] ,color='orange',weight='bold',fontsize=16)
   plt.xlabel('Team', weight='bold', fontsize=12)
   plt.ylabel(columns[i], weight='bold', fontsize=12)
plt.tight_layout()
plt.show()

Describing above graphical representation

  • Success game rate form the oldest category to the newest category, graph goes decreasing. Oldest category team has kept there performance in good pace, other category have to spike their success rate.
  • Lost game rate for old and new category seems to be in bad shape then the newest. As usual oldest category is in good shape.
  • Draw game rate from the oldest category to the newest category, graph goes increasing. Newest Category are able to compete in tournament as head-on but not able to win, need to bring good strategy to win.
  • Champions and RunnerUps are from oldest team only.

Multi-variate Analysis using EDA¶

In [42]:
plt.figure(figsize=(15,7))
sns.pairplot(basketball_data[['Team', 'Tournament', 'Score', 'PlayedGames', 'WonGames', 'DrawnGames', 'LostGames', 'BasketScored', 'BasketGiven', 'TournamentChampion', 'RunnerUp', 'TeamLaunch', 'HighestPositionHeld']], corner=True)
plt.show();
<Figure size 1080x504 with 0 Axes>

Describing above graphical representation

  • Score, WonGames, BasketScored and PlayedGames,Tournament are purely positively correlated, as all datapoints are in same direction and same line
In [43]:
plt.figure(figsize=(15,7))
sns.heatmap(basketball_data[['Team', 'Tournament', 'Score', 'PlayedGames', 'WonGames', 'DrawnGames', 'LostGames', 'BasketScored', 'BasketGiven', 'TournamentChampion', 'RunnerUp', 'TeamLaunch', 'HighestPositionHeld']].corr(), fmt='.2f',annot=True)
plt.show();

Describing above graphical representation

  • Score, WonGames and BasketScored have 1 as correlation value, which means highly positive correlation.
  • PlayedGames and Tournament have 1 as correlation value, which means highly positive correlation.
  • TeamLaunch is negatively correlated with all columns except HighestPositionHeld, which means old and oldest category have ample data then new and newest category

Consolidating summary¶

  • Team 1 have won the highest tournament champion(38.37%) with performance as 59.63% as win, 20.38% as lost and 19.99% as drawn.
  • Team 60 has not won any tournament champion(0%) with performance as 16.67% as win, 70% as lost and 13.33% as drawn.
  • Total 10 teams are from the oldest team of 1929.Teams are Team 1,2,3,5,7,8,14,45,52,54.
  • Team 1 has scored the highest basket i.e., 5,947 and Team 60 has scored the lowest basket i.e., 34
  • Team 7 has given the highest basket i.e., 3889 and Team 56 has given the lowest basket i.e., 55

3. Please include any improvements or suggestions to the association management on quality, quantity, variety, velocity, veracity etc. on the data points collected by the association to perform a better data analysis in future. At-least 1 suggestion for each point.¶

  1. New teams don't have much data like Team 61 was drop due to insufficient data.
  2. We can include more columns for much further analysis, like top player of the Team, yearwise team performance, yearly injuries and more
  3. As HighestPositionHeld is given, but we can have yearly HighestPositionHeld, which can makes better analysis for team.
  4. Yearly number of players transfer can make more precision for team stability.
  5. Team yearly total budget, spent budget can be added which can make analysis for which player we can buy.
  6. Can add column for coach for team, so that we can find, under which coach team's success rate is high.
  7. We can have pointers data, for every team, so that we can get to know the goal scored accuracy with which pointer(1,2,3).
  8. We can have assist ratio for scoring, so that we can analyse that which player is better at scoring and kind develop the formation as per that.
  9. We can have sponsors data also, so that we are able to know the team popularity and finance.
  10. Data with the updated and the latest records, which have all records of new and old teams and who are all currently active and participating in tournaments.
#¶

PART-C¶

#¶

1. Read the CSV file.¶

In [44]:
company=pd.read_csv('CompanyX_EU.csv')
company.head()
Out[44]:
Startup Product Funding Event Result OperatingState
0 2600Hz 2600hz.com NaN Disrupt SF 2013 Contestant Operating
1 3DLT 3dlt.com $630K Disrupt NYC 2013 Contestant Closed
2 3DPrinterOS 3dprinteros.com NaN Disrupt SF 2016 Contestant Operating
3 3Dprintler 3dprintler.com $1M Disrupt NY 2016 Audience choice Operating
4 42 Technologies 42technologies.com NaN Disrupt NYC 2013 Contestant Operating

2. Data Exploration:¶

A. Check the datatypes of each attribute.¶

In [45]:
company.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 662 entries, 0 to 661
Data columns (total 6 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   Startup         662 non-null    object
 1   Product         656 non-null    object
 2   Funding         448 non-null    object
 3   Event           662 non-null    object
 4   Result          662 non-null    object
 5   OperatingState  662 non-null    object
dtypes: object(6)
memory usage: 31.2+ KB

6 columns are there with Dtype as object 662 entires are present

B. Check for null values in the attributes.¶

In [46]:
company.isnull().sum()
Out[46]:
Startup             0
Product             6
Funding           214
Event               0
Result              0
OperatingState      0
dtype: int64

Total 220 null values are present, for which Funding has 214 as null and Product has 6 as null.

3. Data preprocessing & visualisation:¶

A. Drop the null values.¶

In [47]:
company_wo_null=company.dropna()
company_wo_null.reset_index(drop=True, inplace=True)
company_wo_null.isnull().sum()
Out[47]:
Startup           0
Product           0
Funding           0
Event             0
Result            0
OperatingState    0
dtype: int64

All null values are dropped.

B. Convert the ‘Funding’ features to a numerical value.¶

In [48]:
company_wo_null.loc[:,'FundsInMillion']=company_wo_null['Funding'].apply(lambda x: float(x[1:-1])/1000 if x[-1] == 'K' else (float(x[1:-1])*1000 if x[-1] == 'B' else float(x[1:-1])))
company_wo_null
Out[48]:
Startup Product Funding Event Result OperatingState FundsInMillion
0 3DLT 3dlt.com $630K Disrupt NYC 2013 Contestant Closed 0.63
1 3Dprintler 3dprintler.com $1M Disrupt NY 2016 Audience choice Operating 1.00
2 5to1 5to1.com $19.3M TC50 2009 Contestant Acquired 19.30
3 8 Securities 8securities.com $29M Disrupt Beijing 2011 Finalist Operating 29.00
4 AdhereTech adheretech.com $1.8M Hardware Battlefield 2014 Contestant Operating 1.80
... ... ... ... ... ... ... ...
441 Zivity zivity.com $8M TC40 2007 Contestant Operating 8.00
442 Zmorph zmorph3d.com $1M - Audience choice Operating 1.00
443 Zocdoc zocdoc.com $223M TC40 2007 Contestant Operating 223.00
444 Zula zulaapp.com $3.4M Disrupt SF 2013 Audience choice Operating 3.40
445 Zumper zumper.com $31.5M Disrupt SF 2012 Finalist Operating 31.50

446 rows × 7 columns

In [49]:
company_wo_null.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 446 entries, 0 to 445
Data columns (total 7 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   Startup         446 non-null    object 
 1   Product         446 non-null    object 
 2   Funding         446 non-null    object 
 3   Event           446 non-null    object 
 4   Result          446 non-null    object 
 5   OperatingState  446 non-null    object 
 6   FundsInMillion  446 non-null    float64
dtypes: float64(1), object(6)
memory usage: 24.5+ KB

FundsInMillion column has been added with cleansed values and data type as numeric.

C. Plot box plot for funds in million.¶

In [50]:
box_graph=px.box(company_wo_null, x='FundsInMillion')
box_graph.show()

There are ample number of outliers present in dataset for funds

D. Check the number of outliers greater than the upper fence.¶

In [51]:
q1=np.quantile(company_wo_null.FundsInMillion, 0.25)
q3=np.quantile(company_wo_null.FundsInMillion, 0.75)
cut_off=1.5*(q3-q1)
right_whisker= q3 + 1.5*(q3-q1)
company_wo_null[company_wo_null.FundsInMillion>right_whisker].FundsInMillion.count()
Out[51]:
60

Total 60 values are present in FundsInMillion which are outliers.

In [52]:
company_wo_null.FundsInMillion.describe()
Out[52]:
count     446.000000
mean       17.241489
std        90.483710
min         0.005000
25%         0.745250
50%         2.200000
75%         9.475000
max      1700.000000
Name: FundsInMillion, dtype: float64

To remove outliers, will be using IQR formula, as there are extreme outliers present. Q3 is 9.475 and max is 1700, which is very huge difference, so won't be using z-score.

In [53]:
#Replace every outlier on the upper side by the upper whisker
company_iqr=company_wo_null.copy(deep=True)
for i in np.where(company_iqr.FundsInMillion > q3 + cut_off)[0]:

    whisker  = q3 + cut_off
    company_iqr.loc[i,'FundsInMillion'] = whisker

company_iqr
Out[53]:
Startup Product Funding Event Result OperatingState FundsInMillion
0 3DLT 3dlt.com $630K Disrupt NYC 2013 Contestant Closed 0.630000
1 3Dprintler 3dprintler.com $1M Disrupt NY 2016 Audience choice Operating 1.000000
2 5to1 5to1.com $19.3M TC50 2009 Contestant Acquired 19.300000
3 8 Securities 8securities.com $29M Disrupt Beijing 2011 Finalist Operating 22.569625
4 AdhereTech adheretech.com $1.8M Hardware Battlefield 2014 Contestant Operating 1.800000
... ... ... ... ... ... ... ...
441 Zivity zivity.com $8M TC40 2007 Contestant Operating 8.000000
442 Zmorph zmorph3d.com $1M - Audience choice Operating 1.000000
443 Zocdoc zocdoc.com $223M TC40 2007 Contestant Operating 22.569625
444 Zula zulaapp.com $3.4M Disrupt SF 2013 Audience choice Operating 3.400000
445 Zumper zumper.com $31.5M Disrupt SF 2012 Finalist Operating 22.569625

446 rows × 7 columns

In [54]:
box_graph=px.box(company_iqr, x='FundsInMillion')
box_graph.show()

All outliers have been removed from FundsInMillion.

E. Check frequency of the OperatingState features classes.¶

In [55]:
plt.figure(figsize=(15,7))
sns.countplot(data=company_iqr, x='OperatingState')
plt.xlabel('OperatingState', weight='bold', fontsize=12)
plt.show()

Describing above graphical representation

  • There are 319 where status is as Operating.
  • There are 66 where status is as Acquired.
  • There are 57 where status is as Closed.
  • There are 4 where status is as Ipo.

4. Statistical Analysis:¶

A. Is there any significant difference between Funds raised by companies that are still operating vs companies that closed down?¶

B. Write the null hypothesis and alternative hypothesis.¶

C. Test for significance and conclusion¶

In [56]:
company_iqr_group_operating_state=company_iqr.groupby(['OperatingState'])['FundsInMillion'].sum().reset_index()
hist_graph=px.histogram(company_iqr_group_operating_state, x='OperatingState', y='FundsInMillion',title='OperatingState vs FundsRaised')
hist_graph.update_layout(xaxis_title="OperatingState", yaxis_title="FundsInMillion", title_x=0.5)
hist_graph.show()

Company whose status is operating has highest fundsRaised.

Step-1: Null Hypothesis: Ho: There is no significant difference between Funds raised by companies that are still operating vs companies that closed down. Alternative Hypothesis: Ha: There is significant difference between Funds raised by companies that are still operating vs companies that closed down.

Step-2: Level of Significance, alpha =0.05(default value)

Step-3: As 2 independent samples are present for FundsRaised for operating and closed, so we can use 2 sample T test.

In [57]:
# Step-4: Calculate p_val
t_stat, p_val=stats.ttest_ind(company_iqr[company_iqr.OperatingState=='Operating']['FundsInMillion'], company_iqr[company_iqr.OperatingState=='Closed']['FundsInMillion'], equal_var=False)
t_stat, p_val
Out[57]:
(4.039228118032694, 9.67724782248064e-05)

Step-5: As p_val(9.677e-05) is smaller than level of significant(0.05), we can reject the null hypothesis. Yes, there is significant difference between Funds raised by companies that are still operating vs companies that closed down.

D. Make a copy of the original data frame¶

In [58]:
company_copy=company.copy(deep=True)
company_copy
Out[58]:
Startup Product Funding Event Result OperatingState
0 2600Hz 2600hz.com NaN Disrupt SF 2013 Contestant Operating
1 3DLT 3dlt.com $630K Disrupt NYC 2013 Contestant Closed
2 3DPrinterOS 3dprinteros.com NaN Disrupt SF 2016 Contestant Operating
3 3Dprintler 3dprintler.com $1M Disrupt NY 2016 Audience choice Operating
4 42 Technologies 42technologies.com NaN Disrupt NYC 2013 Contestant Operating
... ... ... ... ... ... ...
657 Zivity zivity.com $8M TC40 2007 Contestant Operating
658 Zmorph zmorph3d.com $1M - Audience choice Operating
659 Zocdoc zocdoc.com $223M TC40 2007 Contestant Operating
660 Zula zulaapp.com $3.4M Disrupt SF 2013 Audience choice Operating
661 Zumper zumper.com $31.5M Disrupt SF 2012 Finalist Operating

662 rows × 6 columns

E. Check frequency distribution of Result variables.¶

In [59]:
plt.figure(figsize=(15,7))
sns.countplot(data=company_copy, x='Result')
plt.xlabel('Result Variables', weight='bold', fontsize=14)
plt.show()

Describing above graphical representation

  • There are 488 where result variable is as Contestant.
  • There are 84 where result variable is as Finalist.
  • There are 41 where result variable is as Audience choice.
  • There are 26 where result variable is as Winner.
  • There are 23 where result variable is as Runner up.

F. Calculate percentage of winners that are still operating and percentage of contestants that are still operating¶

In [60]:
total_operating_count=company_copy.OperatingState.value_counts().get('Operating')
company_copy[(company_copy.Result=='Winner') & (company_copy.OperatingState=='Operating')]['Result'].count()/total_operating_count, company_copy[(company_copy.Result=='Contestant') & (company_copy.OperatingState=='Operating')]['Result'].count()/total_operating_count
Out[60]:
(0.04086021505376344, 0.7139784946236559)

4.08% are winners that are still operating 71.3% are contestants that are still operating

G. Write your hypothesis comparing the proportion of companies that are operating between winners and contestants¶

H. Test for significance and conclusion¶

Step-1 Null Hypothesis, Ho: No difference in proportion of companies that are operating between winners and contestants Alternative Hypothesis, Ha: There is difference in proportion of companies that are operating between winners and contestants

Step-2 Level of significance, alpha=0.05(default)

Step-3 We can use Chi-Square test of independence as these are categorical data. Let's build contigency table

In [61]:
# Creation of contigency table for further checks
contigency_table=pd.crosstab(company_copy.OperatingState, company_copy.Result, margins=True)
contigency_table
Out[61]:
Result Audience choice Contestant Finalist Runner up Winner All
OperatingState
Acquired 0 62 15 2 7 86
Closed 9 90 7 0 0 106
Ipo 0 4 1 0 0 5
Operating 32 332 61 21 19 465
All 41 488 84 23 26 662
In [62]:
# Step-4: Calculate p_val
chi2, p_val, dof, exp_freq=stats.chi2_contingency(contigency_table, correction=False)
chi2, p_val, dof, exp_freq
Out[62]:
(27.904309161964264,
 0.1116839954038626,
 20,
 array([[5.32628399e+00, 6.33957704e+01, 1.09123867e+01, 2.98791541e+00,
         3.37764350e+00, 8.60000000e+01],
        [6.56495468e+00, 7.81389728e+01, 1.34501511e+01, 3.68277946e+00,
         4.16314199e+00, 1.06000000e+02],
        [3.09667674e-01, 3.68580060e+00, 6.34441088e-01, 1.73716012e-01,
         1.96374622e-01, 5.00000000e+00],
        [2.87990937e+01, 3.42779456e+02, 5.90030211e+01, 1.61555891e+01,
         1.82628399e+01, 4.65000000e+02],
        [4.10000000e+01, 4.88000000e+02, 8.40000000e+01, 2.30000000e+01,
         2.60000000e+01, 6.62000000e+02]]))

Step-5: As p_val(0.1116) is greater than level of significant(0.05), we can't reject the null hypothesis. Hence, no difference in proportion of companies that are operating between winners and contestants

I. Select only the Event that has ‘disrupt’ keyword from 2013 onwards.¶

In [63]:
# Getting all year from events for further checks
company_copy_event=company_copy.copy(deep=True)
company_copy_event.loc[:,'EventYear']=company_copy_event.Event.apply(lambda x: x[-5:] if x != '-' else 0).astype(int)
company_copy_event
Out[63]:
Startup Product Funding Event Result OperatingState EventYear
0 2600Hz 2600hz.com NaN Disrupt SF 2013 Contestant Operating 2013
1 3DLT 3dlt.com $630K Disrupt NYC 2013 Contestant Closed 2013
2 3DPrinterOS 3dprinteros.com NaN Disrupt SF 2016 Contestant Operating 2016
3 3Dprintler 3dprintler.com $1M Disrupt NY 2016 Audience choice Operating 2016
4 42 Technologies 42technologies.com NaN Disrupt NYC 2013 Contestant Operating 2013
... ... ... ... ... ... ... ...
657 Zivity zivity.com $8M TC40 2007 Contestant Operating 2007
658 Zmorph zmorph3d.com $1M - Audience choice Operating 0
659 Zocdoc zocdoc.com $223M TC40 2007 Contestant Operating 2007
660 Zula zulaapp.com $3.4M Disrupt SF 2013 Audience choice Operating 2013
661 Zumper zumper.com $31.5M Disrupt SF 2012 Finalist Operating 2012

662 rows × 7 columns

In [64]:
company_copy_event[(company_copy_event.EventYear>=2013) & (company_copy_event.Event.str.contains('Disrupt'))].reset_index()
Out[64]:
index Startup Product Funding Event Result OperatingState EventYear
0 0 2600Hz 2600hz.com NaN Disrupt SF 2013 Contestant Operating 2013
1 1 3DLT 3dlt.com $630K Disrupt NYC 2013 Contestant Closed 2013
2 2 3DPrinterOS 3dprinteros.com NaN Disrupt SF 2016 Contestant Operating 2016
3 3 3Dprintler 3dprintler.com $1M Disrupt NY 2016 Audience choice Operating 2016
4 4 42 Technologies 42technologies.com NaN Disrupt NYC 2013 Contestant Operating 2013
... ... ... ... ... ... ... ... ...
271 646 YayPay Inc yaypay.com $900K Disrupt London 2015 Contestant Operating 2015
272 648 YOOBIC yoobic.com NaN Disrupt London 2015 Finalist Operating 2015
273 653 ZAP! zapreklam.com/ NaN Disrupt EU 2014 Audience choice Operating 2014
274 656 Zenefits zenefits.com $583.6M Disrupt NYC 2013 Finalist Operating 2013
275 660 Zula zulaapp.com $3.4M Disrupt SF 2013 Audience choice Operating 2013

276 rows × 8 columns

276 companies are present having ‘disrupt’ keyword from 2013 onwards.